Assignment: This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
#Load train data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentData.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_beverage <- read_excel(temp_file)
#Copy data
beverage_df <- data_beverage
#Clean temp file
unlink(temp_file)
#Load test data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentEvaluation.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_eval <- read_excel(temp_file)
#Copy data
eval_df <- data_eval
Beverage dataset: 2571 rows x 33 columns.
Evaluation dataset: 267 rows x 33 columns.
Mostly numeric, with one categorical variable (Brand Code).
NAs:
Brand Code: ~4.7% missing in beverage_df, ~3.0% missing in
eval_df.
Many numeric features have a small percentage of missing values,
generally < 2%.
PH: 4 NAs, the target variable in beverage_df. eval_df has all values
missing for PH (used for prediction).
Impute missing values for numeric features using median or mean (or more
advanced imputation if correlated features exist). Missing values should
be handled within a robust pipeline to avoid information loss.
Brand Code can be imputed or left as-is, depending on the percentage
missing per brand.
Skewness and outliers
Variables like Mnf Flow have extreme values (mean is heavily influenced
by outliers). Range: -100.2 to 229.4.
PH: Check for extreme pH values that may affect model accuracy.
Hyd Pressure1–4: High standard deviations (e.g., Hyd Pressure2 with a
mean of ~21 and SD of 16.4).
Analyze the distribution of these variables using histograms or
boxplots. Maybe winsorization or BoxCox/log transformation for skewed
distributions.
Feature Importance for PH pred
Carb Volume, Fill Ounces, PC Volume, Carb Pressure, and Carb Temp have
small sd and are likely controlled manufacturing parameters. These might
directly influence pH.
Brand Code can be treated as a categorical predictor for brand-specific
variations.
Correlation or feature importance to identify which variables most
strongly influence PH.
Brand Code: 4 levels (A, B, C, D).
Unbalanced distribution: B accounts for ~50% of records, while A, C, and
D are much smaller. The imbalance might affect models like decision
trees or ensemble methods.
Apply stratified sampling or weighting to handle imbalance during
training. Explore interaction effects between Brand Code and numeric
variables.
Multicollinearity
Variables such as Carb Volume, PC Volume, and Carb Pressure might be
correlated due to their role in carbonation.
Multiple pressure-related variables (Carb Pressure1, Fill Pressure,
etc.) and filler speed/level metrics could also have collinear
relationships.
Compute a correlation matrix to detect highly correlated
predictors.
Principal Component Analysis (PCA) or Variance Inflation Factor (VIF) to
handle multicollinearity.
Data Leakage
Variables like PSC, PSC Fill, and PSC CO could be downstream measures
dependent on pH. Confirm whether these are part of the production
process or outcome metrics.
Analyze the production process to ensure no data leakage into the
model.
eval_df
All PH values are missing for prediction. Maybe remove this column for
now.
Structure and missingness are similar to beverage_df. Ensure
preprocessing and feature engineering pipelines are consistent between
training and evaluation datasets.
Feature engineering
Ratios: Carb Volume / PC Volume (efficiency metrics)
Differences: Carb Pressure - Fill Pressure (pressure loss)
One-hot encoding for Brand Code
Binning continuous variables (e.g., temperature ranges)
I assume there is no timestamps, no need to add seasonality or
shift-based features.
Modeling Considerations
Use feature scaling:variables like Carb Flow and Filler Speed have very
different ranges and should be normalized or scaled for models like SVM
or neural networks.
If PH is skewed, maybe log or BoxCox transforms it for models sensitive
to distribution (e.g., linear regression).
#Check first rows of data
DT::datatable(
beverage_df[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 1: First 10 Rows of Beverage Data'
))
DT::datatable(
eval_df[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 2: First 10 Rows of Evaluation Data'
))
#Summary statistics
DT::datatable(
dfSummary(beverage_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 3: Summary Statistics for Beverage Data'
))
#Summary statistics
DT::datatable(
dfSummary(eval_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 4: Summary Statistics for Evaluation Data'
))
stat <- beverage_df %>%
group_by(`Brand Code`) %>%
filter(!is.na(`Brand Code`)) %>%
dplyr::summarize(
Min = min(PH, na.rm = TRUE),
Q1 = quantile(PH, 0.25, na.rm = TRUE),
Median = median(PH, na.rm = TRUE),
Mean = mean(PH, na.rm = TRUE),
Q3 = quantile(PH, 0.75, na.rm = TRUE),
Max = max(PH, na.rm = TRUE),
StdDev = sd(PH, na.rm = TRUE),
Count = n(),
Missing = sum(is.na(PH)) # Add the count of NA values
)
#Summary statistics by code
DT::datatable(
stat,
options = list(dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE,
paging=FALSE,
info = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 5: Summary Statistics for Each Brand Code'
)) %>%
DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)
The distribution of PH is roughly normal, centered around 8.5, with
some outliers at the lower and upper tails. Gaussian distribution?
Brand B has significantly more entries compared to others. Balancing or
stratified sampling during model training?
PH distribution across Brand Codes:
Brand C has the lowest median PH, while Brand B has the highest.
Outliers are present in all brand codes. Investigate whether these
outliers are measurement errors or valid extreme cases.
ggplot(beverage_df, aes(x = PH)) +
geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Distribution of PH", x = "PH", y = "Frequency")
ggplot(beverage_df, aes(x = `Brand Code`)) +
geom_bar(fill = "lightgreen") +
theme_minimal() +
labs(title = "Count of Entries by Brand Code", x = "Brand Code", y = "Count")
ggplot(beverage_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH")
#ggplot(beverage_df, aes(x = `Carb Pressure`, y = PH)) +
# geom_point(alpha = 0.6) +
# theme_minimal() +
# labs(title = "Scatterplot: Carb Pressure vs PH", x = "Carb Pressure", y = "PH")
MFR, Fill Speed, Hyd Pressure3, and Pressure Setpoint have
outliers.
PC Volume and PSC CO2 show categorical-like distributions.
Mnf Flow has a long-tailed distribution.
Filler Speed, Hyd Pressure2, and Usage Cont have sharp peaks.
Hyd Pressure4 has an interesting spread, insight into process
variability.
MFR and Oxygen Filler are heavily right-skewed.
Log or Box-Cox transformation for variables like MFR and Usage Cont to
normalize.
Group categorical-like numeric variables (e.g., PSC CO2, Carb Volume) to
improve interpretability.
numeric_vars <- beverage_df %>%
dplyr::select(where(is.numeric)) %>%
names()
#3 groups
group_1 <- numeric_vars[1:10]
group_2 <- numeric_vars[11:20]
group_3 <- numeric_vars[21:32]
#Group 1 plot
beverage_df %>%
dplyr::select(all_of(group_1)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 1")
# Group 2 Plot
beverage_df %>%
dplyr::select(all_of(group_2)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 2")
# Group 3 Plot
beverage_df %>%
dplyr::select(all_of(group_3)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 3")
The first two principal components explain approximately 20.16% and
18.74% of the variance. Some dimensionality reduction?
Variables close to each other (e.g., Balling, Balling Lvl, Density, and
Carb Rel) are strongly correlated. Dim1: Balling, Balling Lvl, Density,
Carb Rel Dim2: Hyd Pressure3, Carb Pressure1, Fill Speed, and PH. Use
them for modeling.
Filler Level, Bowl Setpoint, and PH are grouped together, they may
jointly influence the PH outcome.
Mnf Flow and Hyd Pressure3 are distant, unique variance.
Variables near the origin maybe drop.
Combine strongly correlated (e.g., Carb Rel, Carb Flow, Density) or
reduce using PCA components or clustering, factor analysis.
#PCA
pca <- PCA(beverage_df %>% dplyr::select(where(is.numeric)) %>% na.omit(), scale.unit = TRUE)
MFR highest percentage of missing values, around 7%. It is either a
challenging variable to collect or might not be consistently relevant
across all rows.
mean, median, or a model-based imputation if MFR is critical. If it does
not strongly correlate with the target or other features, drop?
When <3% missingness (PSC CO2, PC Volume, etc.), simple imputation
techniques.
For Brand Code, mode imputation.
#NA
gg_miss_var(beverage_df, show_pct = TRUE)
#Outliers
boxplot(beverage_df$PH, main = "Boxplot of PH", horizontal = TRUE)
numeric_vars <- beverage_df %>%
dplyr::select(where(is.numeric)) %>%
dplyr::select(-PH) %>%
names()
for (var in numeric_vars) {
p <- ggplot(beverage_df, aes_string(x = paste0("`", var, "`"), y = "PH")) +
geom_point(alpha = 0.6, color = "blue") +
theme_minimal() +
labs(title = paste("Scatterplot:", var, "vs PH"), x = var, y = "PH")
print(p) # Display the plot
}
Bowl Setpoint (0.36) and Filler Level (0.35) show the strongest
positive correlations with pH. These features might be significant
predictors in any pH prediction model.
Carb Flow (0.23) and Pressure Vacuum (0.22) also have moderate positive
correlations with pH.
Additional variables such as Carb Rel (0.20) and Alch Rel (0.17) show
smaller but noteworthy correlations.
Usage Cont (-0.36) and Mnf Flow (-0.46) have strong negative
correlations with pH, indicating these might inversely impact pH
levels.
Other variables like Pressure Setpoint (-0.31) and Fill Pressure (-0.32)
also negatively correlate with pH.
Many variables such as Carb Volume, PSC, and PSC Fill have very weak
correlations (close to 0), suggesting they might not be influential in
predicting pH.
Density, Balling, and Carb Rel show strong intercorrelations among themselves. Multicollinearity issues in linear regression models. Maybe PCA or feature selection.
Mnf Flow and Usage Cont have strong negative correlations, maybe additional preprocessing or transformation.
Interaction terms between Bowl Setpoint and Filler Level or between Pressure Vacuum and Carb Flow could also be explored, given their individual relationships with pH.
tst <- beverage_df %>%
dplyr::select(where(is.numeric))
#Correlation with PH
cor_table <- cor(drop_na(tst))[, "PH"] %>%
as.data.frame() %>%
rownames_to_column(var = "Variable") %>%
rename(Coefficient = ".") %>%
arrange(desc(Coefficient))
kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
add_header_above(c("Table: Correlation with PH" = 2))
| Variable | Coefficient |
|---|---|
| PH | 1.0000000 |
| Bowl Setpoint | 0.3615875 |
| Filler Level | 0.3520440 |
| Carb Flow | 0.2335937 |
| Pressure Vacuum | 0.2197355 |
| Carb Rel | 0.1960515 |
| Alch Rel | 0.1666822 |
| Oxygen Filler | 0.1644854 |
| Balling Lvl | 0.1093712 |
| PC Volume | 0.0988667 |
| Density | 0.0955469 |
| Balling | 0.0767002 |
| Carb Pressure | 0.0762134 |
| Carb Volume | 0.0721325 |
| Carb Temp | 0.0322794 |
| Air Pressurer | -0.0079972 |
| PSC Fill | -0.0238098 |
| Filler Speed | -0.0408830 |
| MFR | -0.0451965 |
| Hyd Pressure1 | -0.0470664 |
| PSC | -0.0698730 |
| PSC CO2 | -0.0852599 |
| Fill Ounces | -0.1183359 |
| Carb Pressure1 | -0.1187642 |
| Hyd Pressure4 | -0.1714340 |
| Temperature | -0.1826596 |
| Hyd Pressure2 | -0.2226600 |
| Hyd Pressure3 | -0.2681018 |
| Pressure Setpoint | -0.3116639 |
| Fill Pressure | -0.3165145 |
| Usage cont | -0.3576120 |
| Mnf Flow | -0.4592313 |
#Corr matrix
rcore <- rcorr(as.matrix(tst))
coeff <- rcore$r
#Filter to include only |r| > 0.1, the rest are 0
filtered_coeff <- coeff
filtered_coeff[abs(filtered_coeff) < 0.1] <- 0
coeff <- rcore$r
corrplot(filtered_coeff, tl.cex = .6, tl.col="black", method = 'color', addCoef.col = "black",
type="upper", order="hclust", number.cex=0.45, diag=FALSE)
#Heatmap maybe
#numeric_cols <- beverage_df %>%
# select(where(is.numeric)) %>%
# na.omit()
#correlation_matrix <- cor(numeric_cols)
#corrplot::corrplot(correlation_matrix, method = "color", type = "upper", diag = FALSE)
clean_bev <- beverage_df
nearzero <- nearZeroVar(clean_bev, saveMetrics= TRUE)
nearzero[nearzero$nzv,][1:5,] %>% drop_na()
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31.11111 9.529366 FALSE TRUE
MFR shows 7% missing - we’ll remove this feature Hyd Pressure1 shows near zero variance - we can remove this feature
# Remove 2 fields
cleantrain <- clean_bev %>%
dplyr::select(-c(MFR, `Hyd Pressure1`))
# remove the fields from our evaluation data
cleaneval <- eval_df %>%
dplyr::select(-c(MFR, `Hyd Pressure1`))
cleaneval<- cleaneval%>%
dplyr::select(-PH)
set.seed(100)
# drop rows with missing PH
cleantrain <- cleantrain %>%
filter(!is.na(PH))
# Change Brand Code missing to 'Unknown' in our training data
brandcode <- cleantrain %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
cleantrain$`Brand Code` <- brandcode$`Brand Code`
# Change Brand Code missing to 'Unknown'
brandcodeev <- cleaneval %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
cleaneval$`Brand Code` <- brandcodeev$`Brand Code`
# change an unseen data as well
codes <- unique(cleantrain$`Brand Code`)
cleaneval <- cleaneval %>%
mutate(`Brand Code` = if_else(`Brand Code` %in% codes, `Brand Code`, 'Unknown'))
# Use the kNN imputing method from VIM package to impute missing values in our training data
cleantrain <- cleantrain %>%
kNN(k=10) %>%
dplyr::select(colnames(cleantrain))
# # Use the kNN imputing method from VIM package to impute missing values in our training data
cleaneval<- cleaneval %>%
kNN(k=10) %>%
dplyr::select(colnames(cleaneval))
Add dummy variables for Brand Code
cleantrain_dummy <- dummyVars(PH ~ `Brand Code`, data = cleantrain)
dummies <- predict(cleantrain_dummy, cleantrain)
dummy_cols <- sort(colnames(dummies))
# columns are sorted
dummies <- as.tibble(dummies) %>%
dplyr::select(dummy_cols)
# remove the original categorical feature
cleantrain <- cleantrain %>%
dplyr::select(-`Brand Code`)
# add the new dummy columns to train
cleantrain <- cbind(dummies, cleantrain)
# -----
# Evaluation data - Convert our `Brand Code` column into a set of dummy variables
cleaneval$PH <- 1
eval_dummies <- predict(cleantrain_dummy, cleaneval)
# if not found add 0
for (c in dummy_cols) {
if (!(c %in% colnames(eval_dummies))) {
eval_dummies[c] <- 0
}
}
dummy_cols <- sort(colnames(eval_dummies))
eval_dummies <- as.tibble(eval_dummies) %>%
dplyr::select(dummy_cols)
# remove the original categorical feature
cleaneval <- cleaneval %>%
dplyr::select(-c(`Brand Code`, PH))
# add the new dummy columns to our main eval dataframe
cleaneval <- cbind(eval_dummies, cleaneval)
library(caret)
preProcValues2 <- preProcess(cleantrain, method = "BoxCox")
train_bc <- predict(preProcValues2, cleantrain)
# preProcValues3 <- preProcess(cleaneval, method = "BoxCox")
# test_bc <- predict(preProcValues3, cleaneval )
# preProcValues2
# preProcValues3
#Prepare data for ggplot
gatherdf <- train_bc %>%
dplyr::select(-c(PH)) %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gatherdf) +
geom_histogram(aes(x=value, y = after_stat(density)), bins=50) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=5)
set.seed(100)
# utilizing one dataset for all four models
set <- createDataPartition(train_bc$PH, p=0.7, list=FALSE)
train <- train_bc[set,]
test <- train_bc[-set,]
set.seed(100)
#install.packages("doParallel")
library(doParallel)
library(MASS)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(200)
model1_full_lm <- lm(PH ~ ., train )
# model1_full_lm - (using stepAIC)
model1 <- stepAIC(model1_full_lm , direction = "both",
scope = list(upper = model1_full_lm , lower = ~ 1),
scale = 0, trace = FALSE)
stopCluster(cl)
# Display Model 1 Summary
(lms <- summary(model1))
##
## Call:
## lm(formula = PH ~ `\`Brand Code\`B` + `\`Brand Code\`C` + `\`Brand Code\`D` +
## `Carb Volume` + `Fill Ounces` + `PC Volume` + `Carb Temp` +
## `PSC CO2` + `Mnf Flow` + `Carb Pressure1` + `Hyd Pressure2` +
## `Hyd Pressure3` + `Filler Level` + Temperature + `Usage cont` +
## `Carb Flow` + Density + Balling + `Pressure Vacuum` + `Oxygen Filler` +
## `Bowl Setpoint` + `Pressure Setpoint` + `Air Pressurer` +
## `Carb Rel` + `Balling Lvl`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6481 -0.6695 0.0674 0.7153 4.0027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.172e+04 3.515e+04 2.325 0.020186 *
## `\\`Brand Code\\`B` 5.062e-01 1.185e-01 4.273 2.03e-05 ***
## `\\`Brand Code\\`C` -6.651e-01 1.270e-01 -5.237 1.82e-07 ***
## `\\`Brand Code\\`D` 5.351e-01 1.064e-01 5.030 5.41e-07 ***
## `Carb Volume` -1.231e+02 7.335e+01 -1.679 0.093386 .
## `Fill Ounces` -2.481e-02 1.308e-02 -1.896 0.058153 .
## `PC Volume` -6.680e-01 3.433e-01 -1.945 0.051875 .
## `Carb Temp` 3.954e+03 2.584e+03 1.530 0.126112
## `PSC CO2` -1.316e+00 6.129e-01 -2.147 0.031956 *
## `Mnf Flow` -6.295e-03 4.694e-04 -13.411 < 2e-16 ***
## `Carb Pressure1` 3.849e-01 4.561e-02 8.440 < 2e-16 ***
## `Hyd Pressure2` -1.432e-02 4.580e-03 -3.127 0.001795 **
## `Hyd Pressure3` 3.564e-02 5.596e-03 6.369 2.42e-10 ***
## `Filler Level` -9.309e-05 4.685e-05 -1.987 0.047094 *
## Temperature -4.777e+04 6.851e+03 -6.973 4.35e-12 ***
## `Usage cont` -3.253e-03 5.511e-04 -5.902 4.29e-09 ***
## `Carb Flow` 1.186e-05 3.473e-06 3.415 0.000652 ***
## Density -4.178e-01 2.615e-01 -1.598 0.110321
## Balling -1.967e+00 3.011e-01 -6.532 8.46e-11 ***
## `Pressure Vacuum` -2.065e-01 6.600e-02 -3.128 0.001787 **
## `Oxygen Filler` -2.204e-01 7.211e-02 -3.056 0.002276 **
## `Bowl Setpoint` 2.648e-04 4.986e-05 5.310 1.24e-07 ***
## `Pressure Setpoint` -5.927e+03 1.711e+03 -3.465 0.000543 ***
## `Air Pressurer` -1.147e+05 7.036e+04 -1.630 0.103310
## `Carb Rel` 1.454e+02 7.520e+01 1.933 0.053337 .
## `Balling Lvl` 9.780e-01 1.404e-01 6.965 4.60e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.098 on 1773 degrees of freedom
## Multiple R-squared: 0.444, Adjusted R-squared: 0.4362
## F-statistic: 56.64 on 25 and 1773 DF, p-value: < 2.2e-16
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 1.278301e+04 1.506653e+05
## `\\`Brand Code\\`B` 2.739016e-01 7.385804e-01
## `\\`Brand Code\\`C` -9.141735e-01 -4.160419e-01
## `\\`Brand Code\\`D` 3.264456e-01 7.437564e-01
## `Carb Volume` -2.670009e+02 2.072922e+01
## `Fill Ounces` -5.046880e-02 8.575203e-04
## `PC Volume` -1.341335e+00 5.433517e-03
## `Carb Temp` -1.113581e+03 9.022066e+03
## `PSC CO2` -2.517577e+00 -1.135914e-01
## `Mnf Flow` -7.216087e-03 -5.374781e-03
## `Carb Pressure1` 2.954619e-01 4.743594e-01
## `Hyd Pressure2` -2.330666e-02 -5.339455e-03
## `Hyd Pressure3` 2.466589e-02 4.661792e-02
## `Filler Level` -1.849890e-04 -1.196553e-06
## Temperature -6.121057e+04 -3.433703e+04
## `Usage cont` -4.333408e-03 -2.171747e-03
## `Carb Flow` 5.049152e-06 1.867102e-05
## Density -9.307547e-01 9.513230e-02
## Balling -2.557642e+00 -1.376377e+00
## `Pressure Vacuum` -3.359063e-01 -7.702190e-02
## `Oxygen Filler` -3.618069e-01 -7.894796e-02
## `Bowl Setpoint` 1.669546e-04 3.625515e-04
## `Pressure Setpoint` -9.282345e+03 -2.572205e+03
## `Air Pressurer` -2.526640e+05 2.331970e+04
## `Carb Rel` -2.092798e+00 2.928850e+02
## `Balling Lvl` 7.025916e-01 1.253336e+00
set.seed(100)
library(car)
plot(model1)
set.seed(100)
#install.packages("regclass")
library(regclass)
VIF(model1)
## `\\`Brand Code\\`B` `\\`Brand Code\\`C` `\\`Brand Code\\`D` `Carb Volume`
## 5.226957 2.541740 3.100344 3.811709
## `Fill Ounces` `PC Volume` `Carb Temp` `PSC CO2`
## 1.130737 1.405911 1.075614 1.024303
## `Mnf Flow` `Carb Pressure1` `Hyd Pressure2` `Hyd Pressure3`
## 4.714587 1.500022 8.393839 11.848889
## `Filler Level` Temperature `Usage cont` `Carb Flow`
## 8.484869 1.385065 1.661131 1.799201
## Density Balling `Pressure Vacuum` `Oxygen Filler`
## 10.203170 22.870040 2.144546 2.016515
## `Bowl Setpoint` `Pressure Setpoint` `Air Pressurer` `Carb Rel`
## 8.908272 1.535529 1.205666 5.203412
## `Balling Lvl`
## 22.142918
set.seed(100)
# Predict df_test and calculate performance
model1predict <- predict(model1, test)
# Merge the results into a data frame called results
results <- data.frame()
results <- data.frame(t(postResample(pred = model1predict, obs = test$PH))) %>%
mutate(Model = "Mutiple Regression") %>%
rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 1.177533 | 0.3760906 | 0.9043881 | Mutiple Regression |
#install.packages("glmnet")
library(glmnet)
library(doParallel)
cl2 <- makePSOCKcluster(5)
registerDoParallel(cl2)
set.seed(200)
# to find the right lambda using cv.glmnet
xtrain <- model.matrix(PH ~ ., data = train)
xtest <- model.matrix(PH ~ ., data = test)
cv.glmnet <- cv.glmnet(xtrain, train$PH, alpha = 0)
ridgemdl <- glmnet(xtrain, train$PH, alpha = 0, lambda = cv.glmnet$lambda.min)
stopCluster(cl2)
summary(ridgemdl)
## Length Class Mode
## a0 1 -none- numeric
## beta 35 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
# Predict df_test and calculate performance
ridge <- predict(ridgemdl, xtest)
results <- data.frame(t(postResample(pred = ridge, obs = test$PH))) %>%
dplyr::mutate(Model = "Ridge Regression") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 1.174503 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.177533 | 0.3760906 | 0.9043881 | Mutiple Regression |
cl4 <- makePSOCKcluster(5)
registerDoParallel(cl4)
set.seed(200)
# training the elastic net regression model using train
enet_model <- train(
PH ~ ., data = train, method = "glmnet",
trControl = trainControl("repeatedcv", repeats = 8),
tuneLength = 4
)
stopCluster(cl4)
summary(enet_model)
## Length Class Mode
## a0 97 -none- numeric
## beta 3298 dgCMatrix S4
## df 97 -none- numeric
## dim 2 -none- numeric
## lambda 97 -none- numeric
## dev.ratio 97 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
plot(enet_model)
# Make predictions
enet_pred <- predict(enet_model, test)
results <- data.frame(t(postResample(pred=enet_pred, obs=test$PH))) %>%
mutate(Model = "ElasticNet Regression") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 1.172259 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.174503 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.177533 | 0.3760906 | 0.9043881 | Mutiple Regression |
cl5 <- makePSOCKcluster(5)
registerDoParallel(cl5)
set.seed(200)
# avvnet
nnetGrid <- expand.grid(.decay = c(0.1, 0.5), .size = c(1,10), .bag = FALSE)
nnet.mdl <- train(PH ~ ., data = train, method = "avNNet", preProcess = c("center",
"scale"), tuneGrid = nnetGrid, trControl = trainControl(method = "repeatedcv",
repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)
stopCluster(cl5)
summary(nnet.mdl)
## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 34 -none- character
## terms 3 terms call
## coefnames 34 -none- character
## xlevels 0 -none- list
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 3 -none- list
plot(nnet.mdl)
# Predict df_test and calculate performance
nnetpred <- predict(nnet.mdl, newdata = test)
results <- data.frame(t(postResample(pred = nnetpred, obs = test$PH))) %>%
mutate(Model = "Neural Network (avNNET )") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
cl6 <- makePSOCKcluster(5)
registerDoParallel(cl6)
set.seed(200)
knnMdl <- train(PH ~ ., data = train, method = "knn", preProc = c("center","scale"), tuneLength = 10)
stopCluster(cl6)
summary(knnMdl)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
plot(knnMdl)
# Predict df_test and calculate performance
knn <- predict(knnMdl, newdata = test)
results <- data.frame(t(postResample(pred = knn, obs = test$PH))) %>%
mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 1.0598769 | 0.4952261 | 0.7785023 | k-Nearest Neighbors(kNN) |
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
#GBM
library(caret)
train1 <- train_bc %>% dplyr::select (-PH)
X.train <- train1[set, ]
y.train <- train_bc$PH[set]
X.test <- train1[-set, ]
y.test <- train_bc$PH[-set]
cl7 <- makePSOCKcluster(5)
registerDoParallel(cl7)
set.seed(200)
grid <- expand.grid(n.trees = c(50, 100, 150, 200),
interaction.depth = c(1, 5, 10, 15),
shrinkage = c(0.01, 0.1, 0.5),
n.minobsinnode = c(5, 10, 15))
gbm_Mdl <- train(x = X.train,
y = y.train,
method = "gbm",
tuneGrid = grid,
verbose = FALSE
)
stopCluster(cl7)
summary(gbm_Mdl )
## var rel.inf
## Mnf Flow Mnf Flow 17.50513869
## Oxygen Filler Oxygen Filler 5.21618716
## `Brand Code`C `Brand Code`C 5.15464656
## Usage cont Usage cont 4.82667539
## Carb Pressure1 Carb Pressure1 4.81167716
## Alch Rel Alch Rel 4.21118398
## Air Pressurer Air Pressurer 4.20008387
## Pressure Vacuum Pressure Vacuum 4.17488606
## Temperature Temperature 3.91465311
## Filler Speed Filler Speed 3.57973207
## Balling Lvl Balling Lvl 3.46059110
## Carb Flow Carb Flow 3.39213631
## Density Density 3.29697641
## Carb Rel Carb Rel 3.16671830
## Fill Ounces Fill Ounces 2.22907630
## Balling Balling 2.19179651
## PC Volume PC Volume 2.14110063
## Carb Volume Carb Volume 1.95651762
## Hyd Pressure4 Hyd Pressure4 1.92470429
## Hyd Pressure2 Hyd Pressure2 1.87409118
## PSC PSC 1.80071793
## Bowl Setpoint Bowl Setpoint 1.73877297
## Fill Pressure Fill Pressure 1.73621500
## Hyd Pressure3 Hyd Pressure3 1.73141879
## Carb Pressure Carb Pressure 1.69627016
## PSC Fill PSC Fill 1.68639033
## Filler Level Filler Level 1.67024046
## Carb Temp Carb Temp 1.54919810
## PSC CO2 PSC CO2 1.15678322
## `Brand Code`B `Brand Code`B 0.68534995
## Pressure Setpoint Pressure Setpoint 0.64945532
## `Brand Code`A `Brand Code`A 0.37706909
## `Brand Code`Unknown `Brand Code`Unknown 0.22264230
## `Brand Code`D `Brand Code`D 0.07090368
plot(gbm_Mdl )
gbm_Mdl$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 88 200 15 0.1 5
gbm_Mdl$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 34 predictors of which 34 had non-zero influence.
# Predict df_test and calculate performance
gbm <- predict(gbm_Mdl, newdata = test)
# y_pred <- predict(xgb, data.matrix(X.test[,-1]))
results <- data.frame(t(postResample(pred = gbm, obs = test$PH))) %>% mutate(Model = "Generalized Boosted Models") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 0.8928378 | 0.6405441 | 0.6664880 | Generalized Boosted Models |
| 1.0598769 | 0.4952261 | 0.7785023 | k-Nearest Neighbors(kNN) |
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
options(max.print = 1e+06)
cl8 <- makePSOCKcluster(5)
registerDoParallel(cl8)
set.seed(200)
mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
mars.model <- train(x = X.train, y = y.train, method = "earth", tuneGrid = mars.grid,
preProcess = c("center", "scale"), tuneLength = 10)
stopCluster(cl8)
summary(mars.model)
## Call: earth(x=data.frame[1799,34], y=c(34.44,33.61,3...), keepxy=TRUE,
## degree=2, nprune=15)
##
## coefficients
## (Intercept) 34.505258
## h(-0.187665-Mnf Flow) 2.268911
## h(1.55802-Temperature) 0.381931
## h(Balling-0.666758) 1.813780
## \\Brand Code\\C * h(Hyd Pressure3-0.713986) -0.934977
## \\Brand Code\\C * h(0.713986-Hyd Pressure3) -0.316747
## h(-0.187665-Mnf Flow) * h(1.52742-Carb Pressure1) -0.285200
## h(-0.187665-Mnf Flow) * h(Pressure Vacuum-0.366229) -0.550243
## h(-0.187665-Mnf Flow) * h(0.366229-Pressure Vacuum) -0.540903
## h(-0.187665-Mnf Flow) * h(Air Pressurer-0.823362) -0.505415
## h(Mnf Flow- -0.187665) * h(Air Pressurer-0.32098) 0.442730
## h(1.55802-Temperature) * h(Usage cont-0.231278) -0.329434
## h(Density-1.06311) * h(Oxygen Filler- -1.05263) -1.181989
## h(0.666758-Balling) * h(Bowl Setpoint- -1.34819) 0.304218
## h(Balling-0.666758) * h(1.62844-Alch Rel) -1.023213
##
## Selected 15 of 32 terms, and 13 of 34 predictors (nprune=15)
## Termination condition: RSq changed by less than 0.001 at 32 terms
## Importance: `Mnf Flow`, `\`Brand Code\`C`, `Hyd Pressure3`, Balling, ...
## Number of terms at each degree of interaction: 1 3 11
## GCV 1.104934 RSS 1909.016 GRSq 0.4836855 RSq 0.5035911
# Predict df_test and calculate performance
mars <- predict(mars.model, newdata = X.test)
results <- data.frame(t(postResample(pred = mars, obs = y.test))) %>% mutate(Model = "Multivariate Adaptive Reg
ression Splines (MARS)") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 1.1203560 | 0.4367615 | 0.8620190 | Multivariate Adaptive Reg |
| ression Splines (MARS) | |||
| 0.8928378 | 0.6405441 | 0.6664880 | Generalized Boosted Models |
| 1.0598769 | 0.4952261 | 0.7785023 | k-Nearest Neighbors(kNN) |
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
cl9 <- makePSOCKcluster(5)
registerDoParallel(cl9)
set.seed(200)
cubist_Model <- train(x = X.train, y = y.train, method = "cubist")
stopCluster(cl9)
# Predict df_test and calculate performance
Cubist <- predict(cubist_Model, newdata = X.test)
results <- data.frame(t(postResample(pred = Cubist, obs = y.test))) %>% mutate(Model = "Cubist Model") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 0.9116417 | 0.6287763 | 0.6811602 | Cubist Model |
| 1.1203560 | 0.4367615 | 0.8620190 | Multivariate Adaptive Reg |
| ression Splines (MARS) | |||
| 0.8928378 | 0.6405441 | 0.6664880 | Generalized Boosted Models |
| 1.0598769 | 0.4952261 | 0.7785023 | k-Nearest Neighbors(kNN) |
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
set.seed(100)
library(randomForest)
rfModel <- randomForest(X.train, y.train,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, X.test)
#postResample(rfPred, y.test)
results <- data.frame(t(postResample(pred = rfPred, obs = y.test))) %>% mutate(Model = "Random Forsest") %>% rbind(results)
knitr::kable(results)
| RMSE | Rsquared | MAE | Model |
|---|---|---|---|
| 0.8498634 | 0.6907792 | 0.6238147 | Random Forsest |
| 0.9116417 | 0.6287763 | 0.6811602 | Cubist Model |
| 1.1203560 | 0.4367615 | 0.8620190 | Multivariate Adaptive Reg |
| ression Splines (MARS) | |||
| 0.8928378 | 0.6405441 | 0.6664880 | Generalized Boosted Models |
| 1.0598769 | 0.4952261 | 0.7785023 | k-Nearest Neighbors(kNN) |
| 0.9734992 | 0.5752473 | 0.7160752 | Neural Network (avNNET ) |
| 1.1722589 | 0.3812287 | 0.9038890 | ElasticNet Regression |
| 1.1745032 | 0.3779246 | 0.9089871 | Ridge Regression |
| 1.1775326 | 0.3760906 | 0.9043881 | Mutiple Regression |
rfImp <- varImp(rfModel, scale = TRUE) %>%
as.data.frame()
rfImp %>%
arrange(-Overall) %>%
kable() %>%
kable_styling() %>%
scroll_box()
| Overall | |
|---|---|
| Mnf Flow | 55.4294548 |
Brand CodeC
|
48.5689924 |
| Oxygen Filler | 48.4671185 |
| Pressure Vacuum | 47.2839766 |
| Temperature | 39.0917386 |
| Air Pressurer | 38.5935464 |
| Usage cont | 36.8814501 |
| Filler Speed | 36.5237251 |
| Carb Rel | 35.8996462 |
| Balling Lvl | 33.9379320 |
| Alch Rel | 32.7328034 |
| Carb Flow | 32.2676163 |
| Balling | 31.1178814 |
| Carb Pressure1 | 30.7906984 |
| Density | 27.5308937 |
| Bowl Setpoint | 25.1728767 |
| Hyd Pressure3 | 23.1533251 |
| Filler Level | 23.1506157 |
Brand CodeB
|
21.3900442 |
| Carb Volume | 20.8766737 |
| PC Volume | 18.8554462 |
| Hyd Pressure2 | 18.6143465 |
| Fill Pressure | 16.8877679 |
| Hyd Pressure4 | 16.8676225 |
Brand CodeD
|
16.6718170 |
Brand CodeA
|
14.1870874 |
| Pressure Setpoint | 14.0282006 |
Brand CodeUnknown
|
13.0502763 |
| Fill Ounces | 5.7929613 |
| Carb Pressure | 4.4967456 |
| Carb Temp | 3.1029152 |
| PSC Fill | 1.7036334 |
| PSC CO2 | 0.2873638 |
| PSC | -0.7007273 |
varImpPlot(rfModel, sort = TRUE, n.var = 10)
top10 <- varImp(rfModel) %>%
filter(Overall < 57) %>%
arrange(-Overall) %>%
head(10)
cleantrain %>%
dplyr::select(c("PH", row.names(top10))) %>%
cor()
## PH Mnf Flow `Brand Code`C Oxygen Filler
## PH 1.00000000 -0.446975100 -0.280418076 0.171402067
## Mnf Flow -0.44697510 1.000000000 -0.004129172 -0.537695437
## `Brand Code`C -0.28041808 -0.004129172 1.000000000 0.045039296
## Oxygen Filler 0.17140207 -0.537695437 0.045039296 1.000000000
## Pressure Vacuum 0.22053722 -0.527885581 -0.060622215 0.227311133
## Temperature -0.14731541 -0.092651655 0.202061285 0.148297951
## Air Pressurer -0.01365873 -0.049374969 -0.020411142 0.098706380
## Usage cont -0.31699873 0.521310419 0.005411482 -0.310887324
## Filler Speed -0.02157477 0.150580883 -0.003874677 -0.132903757
## Carb Rel 0.16169918 -0.030561658 -0.240218983 -0.007825912
## Balling Lvl 0.09996422 0.036036226 -0.223972463 -0.055573624
## Pressure Vacuum Temperature Air Pressurer Usage cont
## PH 0.220537222 -0.14731541 -0.01365873 -0.316998730
## Mnf Flow -0.527885581 -0.09265165 -0.04937497 0.521310419
## `Brand Code`C -0.060622215 0.20206128 -0.02041114 0.005411482
## Oxygen Filler 0.227311133 0.14829795 0.09870638 -0.310887324
## Pressure Vacuum 1.000000000 0.04683150 0.16817638 -0.303452067
## Temperature 0.046831501 1.00000000 0.05603934 -0.115017934
## Air Pressurer 0.168176380 0.05603934 1.00000000 -0.094720140
## Usage cont -0.303452067 -0.11501793 -0.09472014 1.000000000
## Filler Speed 0.015564308 -0.37782777 -0.01588814 0.075729750
## Carb Rel -0.003370213 -0.09416766 -0.10101035 -0.030865283
## Balling Lvl -0.043948462 -0.17841637 -0.08642802 0.042250201
## Filler Speed Carb Rel Balling Lvl
## PH -0.021574773 0.161699183 0.09996422
## Mnf Flow 0.150580883 -0.030561658 0.03603623
## `Brand Code`C -0.003874677 -0.240218983 -0.22397246
## Oxygen Filler -0.132903757 -0.007825912 -0.05557362
## Pressure Vacuum 0.015564308 -0.003370213 -0.04394846
## Temperature -0.377827773 -0.094167660 -0.17841637
## Air Pressurer -0.015888136 -0.101010352 -0.08642802
## Usage cont 0.075729750 -0.030865283 0.04225020
## Filler Speed 1.000000000 -0.052265643 -0.01494369
## Carb Rel -0.052265643 1.000000000 0.84348310
## Balling Lvl -0.014943687 0.843483101 1.00000000
#%>% corrplot() + title("Correlation between PH and the Top 10 Numerical Variables")
stat
## # A tibble: 4 × 10
## `Brand Code` Min Q1 Median Mean Q3 Max StdDev Count Missing
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 A 8.06 8.42 8.52 8.50 8.6 8.86 0.163 293 0
## 2 B 8 8.44 8.56 8.57 8.7 8.94 0.169 1239 4
## 3 C 7.88 8.3 8.42 8.41 8.54 9.36 0.177 304 0
## 4 D 8.16 8.5 8.62 8.60 8.7 8.92 0.135 615 0
set.seed(100)
library(mice)
#install.packages("writexl")
library(writexl)
#head(test_bc)
#test_predict <- predict(cubist_Model , newdata=cleaneval)
#head(test_predict)
eval_data <- cleaneval %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))
ph_prediction <- predict(cubist_Model,eval_data)
eval_data$ph <- ph_prediction
head(eval_data)
## `Brand Code`A `Brand Code`B `Brand Code`C `Brand Code`D `Brand Code`Unknown
## 1 0 0 0 1 0
## 2 1 0 0 0 0
## 3 0 1 0 0 0
## 4 0 1 0 0 0
## 5 0 1 0 0 0
## 6 0 1 0 0 0
## Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill
## 1 5.480000 24.03333 0.2700000 65.4 134.6 0.236 0.40
## 2 5.393333 23.95333 0.2266667 63.2 135.0 0.042 0.22
## 3 5.293333 23.92000 0.3033333 66.4 140.4 0.068 0.10
## 4 5.266667 23.94000 0.1860000 64.8 139.0 0.004 0.20
## 5 5.406667 24.20000 0.1600000 69.4 142.2 0.040 0.30
## 6 5.286667 24.10667 0.2120000 73.4 147.2 0.078 0.22
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure2 Hyd Pressure3
## 1 0.04 -100 116.6 46.0 0 0
## 2 0.08 -100 118.8 46.2 0 0
## 3 0.02 -100 120.2 45.8 0 0
## 4 0.02 -100 124.8 40.0 0 0
## 5 0.06 -100 115.0 51.4 0 0
## 6 0.04 -100 118.6 46.4 0 0
## Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow
## 1 96 129.4 3986 66.0 21.66 2950
## 2 112 120.0 4012 65.6 17.60 2916
## 3 98 119.4 4010 65.6 24.18 3056
## 4 132 120.2 3997 74.4 18.12 28
## 5 94 116.0 4018 66.4 21.32 3214
## 6 94 120.4 4010 66.6 18.00 3064
## Density Balling Pressure Vacuum Oxygen Filler Bowl Setpoint Pressure Setpoint
## 1 0.88 1.398 -3.8 0.022 130 45.2
## 2 1.50 2.942 -4.4 0.030 120 46.0
## 3 0.90 1.448 -4.2 0.046 120 46.0
## 4 0.74 1.056 -4.0 0.052 120 46.0
## 5 0.88 1.398 -4.0 0.082 120 50.0
## 6 0.84 1.298 -3.8 0.064 120 46.0
## Air Pressurer Alch Rel Carb Rel Balling Lvl ph
## 1 142.6 6.56 5.34 1.48 38.62435
## 2 147.2 7.14 5.58 3.04 38.84820
## 3 146.6 6.52 5.34 1.46 38.61751
## 4 146.4 6.48 5.50 1.48 38.63840
## 5 145.8 6.50 5.38 1.46 38.61926
## 6 146.0 6.50 5.42 1.44 38.63348
write_xlsx(eval_data, 'StudentEvaluation_final.xlsx')
hist(eval_data$ph)